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We propose a simple model for an atomic switch-based decision maker (ASDM), and show that, as 
long as its total volume of precipitated Ag atoms is conserved when coupled with suitable operations, 
an atomic switch system provides a sophisticated “decision-making” capability that is known to be 
one of the most important intellectual abilities in human beings. We considered the multi-armed 
bandit problem (MAB); the problem of finding, as accurately and quickly as possible, the most 
profitable option from a set of options that gives stochastic rewards. These decisions are made 
as dictated by each volume of precipitated Ag atoms, which is moved in a manner similar to the 
fluctuations of a rigid body in a tug-of-war game. The “tug-of-war (TOW) dynamics” of the ASDM 
exhibits higher efficiency than conventional MAB solvers. We show analytical calculations that 
validate the statistical reasons for the ASDM dynamics to produce such high performance, despite its 
simplicity. These results imply that various physical systems, in which some conservation law holds, 
can be used to implement efficient “decision-making objects.” Efficient MAB solvers are useful for 
many practical applications, because MAB abstracts a variety of decision-making problems in real- 
world situations where an efficient trial-and-error is required. The proposed scheme will introduce 
a new physics-based analog computing paradigm, which will include such things as “intelligent 
nanodevices” and “intelligent information networks” based on self-detection and self-judgment. 
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I. INTRODUCTION 

When we look at the natural world, information processing in biological systems is elegantly coupled with their 
underlying physics HI- This suggests a potential for establishing a new physics-based analog-computing paradigm. 
A proposal was made about ten years ago for a conceptually novel switching device, called the “atomic switch,” 
that is based on metal ion migration and electrochemical reactions in solid electrolytes Q. Because its resistance 
state is controlled continuously by the movement of a limited number of metal ions/atoms, the atomic switch can be 
regarded as a physics-based analog-computing element. In this paper, using atomic switches, we show that a physical 
constraint, the volume conservation law, allows for the efficient solving of decision-making problems which, in human 
beings, is one of the most important intellectual abilities. 

Suppose there are M slot machines, each of which returns a reward; for example, coins, with a certain probability 
density function (PDF) that is unknown to a player. Let us consider a minimal case: two machines A and B give 
rewards with individual PDF whose mean reward is p,A and /is, respectively. The player makes a decision on which 
machine to play at each trial, trying to maximize the total reward obtained after repeating several trials. The multi¬ 
armed bandit problem (MAB) is used to determine the optimal strategy for playing machines as accurately and 
quickly as possible by referring to past experience. 

In the context of decision making algorithms, the MAB was originally described by Robbins Q, although the essence 
of the problem had been studied earlier by Thompson [5]. The optimal strategy, called the “Gittins index,” is known 
only for a limited class of problems in which the reward distributions are assumed to be known to the players 00- 
Even in this limited class, in practice, computing the Gittins index becomes intractable for many cases. For the 
algorithms proposed by Agrawal and Auer et ah, another index was expressed as a simple function of the reward sums 
obtained from the machines 0 , 0 ], In particular, the “upper confidence bound 1 (UCB1) algorithm” for solving MABs 
is used worldwide in many practical applications Q. The MAB is formulated as a mathematical problem without loss 
of generality and, as such, is related to various stochastic phenomena. In fact, many application problems in diverse 
fields, such as communications (cognitive networks lid lllj). commerce (advertising on the web [12j]), entertainment 
(Monte-Carlo tree search, which is used for computer games 00 ) , can be reduced to MABs. 
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II. MODEL 

Kim et al. proposed a MAB solution called “tug-of-war (TOW),” which uses a dynamical system. This algo¬ 
rithm was inspired by the spatiotemporal dynamics of a single-celled amoeboid organism (the true slime mold P. 
polycephalum ) [l5l - [2ll | , which maintains a constant intracellular-resource volume while collecting environmental infor¬ 
mation by concurrently expanding and shrinking its pseudopod-like terminal parts. In this bio-inspired algorithm, 
the decision-making function is derived from its underlying physics, which resembles that of a tug-of-war game. The 
physical constraint in TOW dynamics, the conservation law for the volume of the amoeboid body, entails a nonlocal 
correlation among the terminal parts. That is, the volume increment in one part is immediately compensated for by 
volume decrement(s) in the other part(s). In our previous studies [l5l - l2ll ]. we showed that, owing to the nonlocal cor¬ 
relation derived from the volume-conservation law, TOW dynamics exhibit higher performance than other well-known 
algorithms such as the modified e-greedy algorithm and the modified softmax algorithm, which is comparable to the 
UCBl-tuned algorithm (seen as the best choice among parameter-free algorithms Q). These observations suggest 
that efficient decision-making devices could be implemented using any physical object as long as it holds some com¬ 
mon physical attributes, such as the conservation law. In fact, Kim et al. demonstrated that optical energy-transfer 
dynamics between quantum dots, in which energy is conserved, can be exploited for the implementation of TOW 
dynamics [22| - i24l |. 
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FIG. 1: (a) The ASDM using gapless type atomic switches. The ASDM decides which machine (A or/and B) is to be played 
at time t according to whether the current Ik is larger than 9 or not. (b) Voltages Va and Vs. Here, added voltage AVk(j) is 
determined by each reward Rk(j) (Eq. (JTJ) ) at play j (Ik>9). 
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Here, we propose a simplified model for an atomic switch-based decision maker (ASDM). Consider two atomic 
switches located close to each other, in which a solid electrolyte (SE) is sandwiched between one top Pt electrode 
and two bottom Pt electrodes respectively on both sides, as shown in Fig. Oja). Each atomic switch is operated in 
a metal/ionic conductor/metal configuration, which is referred to as a “gapless type atomic switch [25|.” Here we 
assume that operation of both switches is influenced by each other, which implies a certain interaction between the 
two switches. In the initial state, Ag ions are distributed uniformly in the electrolyte. When a bias voltage of —Vo is 
applied to the bottom Pt electrodes relative to the top Pt electrode, Ag ions migrate to the bottom electrodes and 
the same amount of Ag atoms are precipitated on the respective electrode. We define the height of Ag atoms by A' 0 , 
and each displacement of the height of precipitated Ag atoms from X 0 at time t by Xk(t) {k £ {A,B}). The total 
height results in X 0 + X k (t). 

If current I k >9, we consider that the ASDM chooses machine k, and obtains reward R k {j) generated from each 
“unknown PDF” (mean reward [i k is also supposed to be unknown). According to the reward, the added voltage 
A V k (j) is determined by 

A V k (j) = R k (j) - K. (1) 

Here, R k {j) is a “reward” which has an arbitrary real value. AT is a parameter to be described in detail later on in 
this paper. Then, each voltage becomes 

V k = -(V 0 + A V k (j)). (2) 

We assume the following conditions: 

1. At initial equilibrium state, the SE is nearly empty of Ag ions to be precipitated. This implies that an increment 
of one height is compensated by a decrement in the other (Eq.([3]) holds). 

2. If A'o + X k (t) > Th, current I k is larger than 6. Here, Th and 9 are thresholds. If the Th is set to be smaller 
than Ao, this dynamics works from the initial state without fluctuations. 

3. For simplicity, we assume a linear dependence between AI4 and AA' (Eq.f|])) even though it depends on the 
shape of the Ag atoms and the amount of Ag ions remaining. 

4. The time interval for adding voltage At is sufficiently larger than that for interval A ti nt that the decaying effect 
of Ag atoms during A tint can be ignored. 

Displacement AA (= —X B ) is determined by the following equations: 

X A (t + 1) = QA(t) — Q B (t) + 5(t), (3) 

N k 

Q k (t ) = mu). (4) 

i=i 

Here, Q k [t) ( k £ {A, B}) is an “estimate” of information of past experiences accumulated from the initial time 1 to 
the current time t, N k counts the number of times that machine k has been played, A14 is the added voltage when 
playing machine k , 6(t) is an arbitrary fluctuation to which the body is subjected, and K is a parameter. Eqs. (JT]) and 
d3| are called the “learning rule.” Consequently the ASDM dynamics evolve according to a particularly simple rule: 
in addition to the fluctuation, if machine k is played at each time t, R k —K is added to X k (t) (Fig. [T]). 


B. SOFTMAX Algorithm 


The SOFTMAX algorithm is a well-known algorithm for solving MAB problems [26j]. In this algorithm, the 
probability of selecting A or B, P A {t) or P B (t), is given by the following Boltzmann distributions: 


P' A (t) = 
= 


exp [/I • Q A (t)\ 


exp[/3 • Q A {tj) + exp[/3 • Q B (t)\ ’ 
exp[/3 • Q B (t)\ 


( 5 ) 

( 6 ) 


exp[/3 • Q A (t)\ + exp[/3 • Q B {t)\ ’ 
y' N V t ^ R k ij) 

where Q k (t) (k £ {A, B }) is given by — ^ -. Here, /3 is a time-dependent form in our study, as follows: 

P{t) = r-t. (7) 

/3 = 0 corresponds to a random selection, and /3 —> oo corresponds to a greedy action. 
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III. SIMULATION RESULTS 



FIG. 2: Performance comparison between the ASDM and SOFTMAX. We used Ao=0.55 for the ASDM and T op t= 0.3 for 
SOFTMAX. 

From computer simulations, we confirmed that, in almost all cases, an ASDM with the parameter Kq (= Ma ) 
can acquire more rewards than a SOFTMAX algorithm with the optimized parameter r opt , although SOFTMAX is 
well-known as a good algorithm for efficient decision-making j 27]. Figure [2] shows an example of an ASDM/SOFTMAX 
performance comparison. The vertical axis denotes the sum of acquired rewards (mean values of 1000 samples), and 
the horizontal axis denotes the number of plays. For the reward PDFs, we used normal distributions A(/x J 4 ,o’ 2 ) 
and N(hb,ct 2 ), where ha= 0-5, /is=0.6, and <r=0.2. Computer simulations were executed under the condition that 
Th=X o and 5=sin(TT /2 + 7 rt). 


IV. THEORETICAL ANALYSES OF THE ASDM 

Theoretical analyses of the TOW dynamics for a Bernoulli type MAB problem, in which a reward is limited to 0 
or 1, are described in J2l|. In this section, theoretical analyses of the ASDM are described for a general MAB where 
a reward is not limited to 0 or 1. 


A. Solvability of the MAB 

To explore the MAB solvability of the ASDM dynamics, let us consider a random-walk model as shown in Fig.[3](a). 
Here, R k (t) (k £ {A, B}) is a reward at time f, and K is a parameter (see Eq.([l])). We assume that means of 
the probability density function of Rk satisfy /i a > Hb for simplicity. After time step t, the displacement Dk{t) 
(k £ {A, B}) can be described by 


N k (t) 

D k (t) = ^ Rk{j) — K Nk(t). (8) 

3=1 

The expected value of D k can be obtained from the following equation: 

E(D k (t)) = ( f i k -K)N k (t). (9) 

In the overlapping area between the two distributions shown in Fig. 0(b) , we cannot accurately estimate which is 
larger. The overlapping area should decrease as N k increases so as to avoid incorrect judgments. This requirement 
can be expressed by the following forms: 


HA- K > 0 , 

Hb — K < 0 . 


( 10 ) 

( 11 ) 
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FIG. 3: (a) Random walk: flight Rk(t) — K . Here, the probability density function of Rk has the mean pt ,■ (b) Probability 
distributions of two random walks. 


These expressions can be rearranged into the form 

Hb < K < Ha- 


( 12 ) 


In other words, the parameter K must satisfy the above conditions so that the random walk correctly represents the 
larger judgment. 

We can easily confirm that the following form satisfies the above conditions: 


From Qk(t) and Eg. (THU) , we obtain 


K = 


HA + HB 
2 


K 0 

7 


7 

2 ’ 

HA + Hb- 


(13) 


(14) 

(15) 


Here, we have set the parameter K to Kq. Therefore, we can conclude that the ASDM dynamics using the learning 
rule Qk with the parameter Kq can solve the MAB correctly. 


B. Origin of the high performance 

In many popular algorithms such as the e-greedy algorithm, at each time f, an estimate of reward probability is 
updated for either of the two machines being played. On the other hand, in an imaginary circumstance in which the 
sum of the mean rewards 7 = ha + Hb is known to the player, we can update both of the two estimates simultaneously, 
even though only one of the machines was played. 


TABLE I: Estimates for each mean reward based on the knowledge that machine A was played Na times and that machine B 
was played Nb times—on the assumption that the sum of the mean rewards 7 = fiA + fiB is known. 


A: 

T."A r aU) 

N a 

B: 

E ?A r aU) 
7 

A: 

r b u) 

1 _ N* _ 

B: 

E" = B i r bU) 

_ nb _ 


The top and bottom rows of Table Q] provide estimates based on the knowledge that machine A was played Na times 
and that machine B was played Nb times, respectively. Note that we can also update the estimate of the machine 
that was not played, owing to the given 7 . 
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From the above estimates, each expected reward Q' k (k € {A, B}) is given as follows: 

,n a 


Q'a = N a 


, Ef=i rbU), 


+ Nb (7 - 


N a ■ - - y Nb 

Na N b 

Y Ra ^~Y RbU) + lN B , 

3 =1 i=1 


(16) 


Q'b — N A (7 — 


Ef=i RaU) 


-) + Nb 


Zj°iRbU) 


n a 

N b N a 

Y Rb O’) _ X! + 7 ]Va - 

i=l i=l 




(17) 


These expected rewards, Q) s, are not the same as those given by the learning rules of TOW dynamics, QjS in Eqs. m 
and 0). However, what we use substantially in TOW dynamics is the difference 


N a N b 

Q A Qb — (Y, RaU) - Yj Rb (■?')) _ R ' (N A - Nb)- 

3=i 3=i 


(18) 


When we transform the expected rewards Q 's into Q'j = Q'J 2, we can obtain the difference 


N a N b 

Qa~Qb = CY R aU) - Y Rb 0')) - l (N A - N b ). (19) 

3=1 3=1 

Comparing the coefficients of Eas. (fT 8 l) and on, the differences in their constituent terms are always equal when 
K = Kq ('Ec OfTH) ') is satisfied. Eventually, we can obtain the nearly optimal weighting parameter Kq in terms of 7 . 

This derivation implies that the learning rule for the ASDM dynamics is equivalent to that of the imaginary system 
in which both of the two estimates can be updated simultaneously. In other words, the ASDM dynamics imitates 
the imaginary system that determines its next move at time t + 1 in referring to the estimates of the two machines, 
even if one of them was not actually played at time t. This unique feature in the learning rule, derived from the fact 
that the sum of mean rewards is given in advance, may be one of the origins of the high performance of the ASDM 
dynamics. 

Monte Carlo simulations were performed it was verified that the ASDM dynamics with Kq exhibits an exceptionally 
high performance, which is comparable to its peak performance—achieved with the optimal parameter K opt . To derive 
the optimal value K opt accurately, we need to take into account the fluctuations. 

In addition, the essence of the process described here can be generalized to M -machine cases. To separate distri¬ 
butions of the top rn-th and top (m + l)-th machine, as shown in Fig. 0b), all we need is the following I\q: 

Kq = ( 20 ) 

7 f-l(m) T M(m+1) (21) 

Here, denotes the top m-th mean, and m is any integer from 1 to M — 1. The MBP is a special case where 
m = 1. In fact, for M- machine and A-player cases, we have designed a physical system that can determine the overall 
optimal state, called the “social maximum,” quickly and accurately [28|, |29(. 


C. Performance characteristics 

To characterize the high performance of the ASDM dynamics, let us consider the imaginary model for solving the 
MAB, called the “cheater algorithm.” The cheater algorithm selects a machine to play according to the following 
estimate Sk (k € {A, B}) 


S A — X Aj i + X A> 2, + • • • + X Aj n, 
Sb = X B ,i + Xb,2, 4 —- + Xb,n- 


( 22 ) 

(23) 
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Here, Xj~,i is a random variable. If Sa > S B at time t = N, machine A is played at time t = N + 1. If Sb > S A at 
time t = N, machine B is played at time t = N + 1. If Sa = S B at time t = N, a machine is played randomly at 
time t = N + 1. Note that the algorithm refers to results of both machines at time t without any attention to which 
machine was played at time t — 1 . In other words, the algorithm “cheats” because it plays both machines and collects 
both results, but declares that it plays only one machine at a time. 

The expected value and the variance of Xk are defined as E(Xk) = Hk and V{Xk) = erf.. Here, /ik is the same 
as the Pk defined earlier. From the central-limit theorem, Sk has a Gaussian distribution with E(Sk ) = HkX and 
V(S k ) = a 2 N . If we define a new variable S = Sa — Sb, S has a Gaussian distribution and carries the following 
values: 


E(S) = {ha+Hb)N, (24) 

V(S) = ( a\ + a 2 B )N , (25) 

a(S) = sJa 2 A + a 2 B y/N. (26) 



FIG. 4: Q(ff>): 


probability of selecting the lower-reward machine in the cheater algorithm 


From Fig. HI the probability of playing machine B 1 which has a lower reward probability, can be described as 
Q(-f^fy). Here, Q(x) is a Q-function. We obtain 

P(t = N + l,B) = Q{<fn/N). (27) 


Here, 


HA ~ fJ-B 

V a A+ a B 


(28) 


Using the Chernoff bound Q(x) < \ exp(—^), we can calculate the upper bound of a measure, called the “regret,” 
which quantifies the accumulated losses of the cheater algorithm. 


regret = {g A - hb)E(N b ). 


(29) 


iV-l 


E{N B ) = (<M) 


t =0 
TV—1 


< 


v-v 1 , (j) 2 x 

E o ex P(-T^ 


t =0 


1 N—l 12 

o + E b ex P(-T i} 


< - 


t=l 
r N -1 


1 . (j) 2 . 

^ ex P(— jt)dt 




(30) 


1 


( 31 ) 









Note that the regret becomes constant as N increases. 

Using the “cheated” results, we can also calculate the regret of the ASDM dynamics in the same way. In this case, 


Sa — Xa,i + Xa,2, + • • • + Xa,na ~ KNa, (32) 

Sb = Xb ,i + Xb,2, H-+ Xb,n b ~ KNb ■ (33) 

Xk,i is also a random variable. Then, we obtain 

E(S k ) = (ft k -K)N k , (34) 

V(S k ) = a\N k . (35) 


Using the new variables S = Sa — Sb, N = Na + Njy, and D = Na — Nn, we also obtain 

E(s) = jv+ (M+M - K ) d, 

vis) = fi±£ijv + Az£l D . 

If the conditions K = Kq and a a = ctb = cr are satisfied, we then obtain 

= ^ a ~^ b N, 

V(S) = ct 2 N, 

and 

P(t = N + l,B) = Q (faVN). 

Here, 

HA- VB 

■h- = 

We can then calculate the upper bound of the regret for the ASDM dynamics 

N—l 

E(N b ) = W’TV^) 

t =0 

s 5-s( exp( “ : T (JV_1)) “ 1 ) 

1 1 

7 : + ~TT- 


(36) 

(37) 

(38) 

(39) 

(40) 

(41) 


(42) 

(43) 


Note that the regret for the ASDM dynamics also becomes constant as N increases. 

It is known that optimal algorithms for the MAB, defined by Auer et ah, have a regret proportional to log(AT) 

The regret has no finite upper bound as N increases because it continues to require playing the lower-reward machine 
to ensure that the probability of incorrect judgment goes to zero. A constant regret means that the probability 
of incorrect judgment remains non-zero in the ASDM dynamics, although this probability is nearly equal to zero. 
However, it would appear that the reward probabilities change frequently in actual decision-making situations, and 
their long-term behavior is not crucial for many practical purposes. For this reason, the ASDM dynamics would be 
more suited to real-world applications. 


V. CONCLUSION AND DISCUSSION 

In this paper, we proposed an ASDM for solving MAB problems, and analytically validated that their high efficiency 
in making a series of decisions for maximizing the total sum of stochastically obtained rewards is embedded in a 
volume-conserving physical system when subjected to suitable operations involving fluctuations. In conventional 
decision-making algorithms for solving MAB problems, the parameter for adjusting the “exploration time” must be 
optimized. This exploration parameter often reflects the difference between the rewarded experiences, i.e., | /ja — Hb\- 
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In contrast, the ASDM demonstrates that higher performance can be achieved by introducing a parameter Kq that 
refers to the sum of the rewarded experiences, i.e., ha + Rb- This type of optimization, using the sum of the rewarded 
experiences, is particularly useful for time varying environments (reward probability or reward PDF) M- Owing to 
this novelty, the high performance of the TOW dynamics can be reproduced when implementing these dynamics with 
atomic switches. 

The ASDM proposed in this paper is a simple “ideal model.” While the assumptions used for constructing the model 
may contain some points that do not match real experimental situations, we can more accurately extend the model so 
that the modified assumptions do match real experimental situations. As long as the TOW dynamics between atomic 
switches is implemented, high performance decision-making can be guaranteed even in the extended model. 

The ASDM will introduce a new physics-based analog computing paradigm, which will include such things as 
“intelligent nanodevices” and “intelligent information networks” based on self-detection and self-judgment. Thus, our 
proposed physics-based analog-computing paradigm would be useful for a variety of real-world applications and for 
understanding the biological information-processing principles that exploit their underlying physics. 
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